Week 2.3 - Fine-Tuning, RLHF and Alignment

What We'll Cover

In the last two sessions we covered how transformers are built and how they are pre-trained on massive text corpora. But a freshly pre-trained model is a strange thing: it has absorbed an enormous amount of human knowledge, yet it would complete your carefully written prompt with something entirely unhelpful—or worse.

This session covers what happens after pre-training: the techniques that turn a raw language model into Claude, ChatGPT, or Gemini. This is also the stage that AI labs are most secretive about—so there is a lot of speculation and partial information in the public record.

We will look at Supervised Fine-Tuning, Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, Direct Preference Optimisation, and finally the practical question every researcher eventually asks: can I fine-tune a model for my own domain?

🧩 The Gap Between Pre-Training and Usefulness

Pre-training produces a model that is extraordinarily good at one thing: predicting the next token. This is not the same as being helpful, honest, or safe.

🔑 What a Base Model Actually Does

If you ask a raw pre-trained model "What is the capital of France?", it might respond: "What is the capital of France? This is a common question asked in geography quizzes. The answer is…" — because that's what the internet looks like. It completes text. It hasn't been taught that your message is a question that deserves a direct answer.

Even more concerning: given a prompt about making a dangerous substance, a base model trained on internet text might simply continue generating whatever plausible text follows such a prompt—because it has no concept of harm, helpfulness, or honesty. It only knows statistical patterns.

The Three Goals of Alignment

Anthropic's original framing (widely adopted across the field) describes the target as three properties that should be balanced:

Helpful: Actually useful to the person making a request
Harmless: Does not cause harm to the user, third parties, or society
Honest: Does not deceive, manipulate, or confabulate

These goals can conflict. A model that refuses everything is harmless but useless. A model that always agrees is helpful in the short term but dishonest. Alignment research is largely about navigating these tensions.

What "Alignment" Means

In AI, alignment refers to the challenge of making a model's behaviour match human values and intentions—not just in cases it was trained on, but in novel situations too.

Narrow alignment: Model follows instructions and avoids obvious harms
Broader alignment: Model behaves safely even as capabilities increase
Why it's hard: Human values are complex, contextual, and sometimes contradictory
Why it matters now: As models become more capable, misalignment becomes more consequential

📋 Supervised Fine-Tuning (SFT)

The first step after pre-training is relatively straightforward: show the model many examples of good behaviour and train it to imitate them.

How SFT Works

Supervised Fine-Tuning takes a pre-trained model and continues training it—but now on a much smaller, carefully curated dataset of (prompt, ideal response) pairs.

Dataset size: Typically thousands to tens of thousands of examples—tiny compared to pre-training
Data source: Written by human contractors or researchers who demonstrate ideal responses
Training objective: Same as pre-training (next-token prediction), but on high-quality instruction data
Result: Model learns to "act like an assistant"—it begins producing structured, helpful responses rather than raw text continuations

Limitations of SFT Alone

SFT is a powerful first step, but it has an important structural weakness.

Imitation, not preference: The model learns to produce responses that look like the training examples—not necessarily responses that are genuinely better
Hard to specify everything: You cannot write ideal responses for every possible prompt; the model has to generalise from examples
Confident errors: A model fine-tuned on examples of fluent, confident writing may produce fluent, confident errors in situations it hasn't seen
Can't compare: SFT doesn't teach the model that response A is better than response B—only that response A is a valid completion

📄 What an Instruction Dataset Looks Like

Modern instruction datasets use a structured format with distinct roles. Here is a simplified example of what a training example might contain:

[System] You are a helpful research assistant. Be concise and accurate.
[User] What is the difference between a p-value and a confidence interval?
[Assistant] A p-value is the probability of observing results at least as extreme as yours if the null hypothesis is true—it is a single number that summarises evidence against the null. A confidence interval gives a range of plausible values for the parameter of interest. They are related but convey different information: a 95% CI that excludes zero is consistent with p < 0.05, but the CI also tells you about the magnitude and precision of the effect, which the p-value does not.

The model is trained to predict the assistant's response given the system prompt and user message. By seeing thousands of such examples, it learns the format, tone, and behaviour expected of an assistant.

📄 The InstructGPT Paper — The Landmark SFT + RLHF Study

OpenAI's 2022 paper introduced the now-standard three-step pipeline (SFT → Reward Model → RL) and showed it made GPT-3 dramatically more helpful with only a small amount of human feedback data.

Training language models to follow instructions with human feedback (Ouyang et al., 2022) — arXiv:2203.02155

🏆 Reinforcement Learning from Human Feedback (RLHF)

RLHF is the technique that transformed language models from clever text completers into the capable, (mostly) well-behaved assistants we use today. It solves the core problem that SFT cannot: how do you teach a model not just to produce good responses, but to prefer better responses over worse ones?

💡 The Core Insight

It is much easier for a human to compare two responses and say "this one is better" than to write the ideal response from scratch. RLHF exploits this: instead of showing the model what good output looks like, we show it which of two outputs is preferred, train a model to predict those preferences, and then use that as a reward signal to improve the language model.

Think of it like training a chef not by handing them recipes, but by letting them cook many dishes and having diners rate them—then using those ratings to guide further cooking.

🔢 The Three-Step RLHF Pipeline

Collect human preference data. The SFT model generates multiple responses to the same prompt. Human annotators compare pairs of responses and select which is better (or rank several). This produces a dataset of (prompt, chosen response, rejected response) triples.
Train a Reward Model (RM). A separate neural network—typically initialised from the SFT model—is trained to predict which response a human would prefer. Given a prompt and a response, the RM outputs a scalar score. This model effectively encodes human preferences in a form the computer can evaluate automatically.
Optimise the LLM with Reinforcement Learning. The SFT model is now the "policy"—it generates responses and receives scores from the reward model. An RL algorithm (typically PPO—Proximal Policy Optimisation) updates the policy to generate higher-scoring responses. A KL-divergence penalty prevents the model from drifting too far from the original SFT model, avoiding incoherent outputs.

Note on PPO: You don't need to understand the maths of PPO to understand RLHF at a conceptual level. The key idea is simply: the model tries out responses, gets scored, and adjusts to score higher—much like how humans learn from feedback.

Why RLHF Works

Captures nuance: Human preferences encode a vast amount of implicit knowledge about what "good" looks like—tone, accuracy, safety, appropriateness—that is nearly impossible to specify explicitly
Generalises: The reward model can score novel prompts that were never in the training set
Handles ambiguity: The RL optimisation can discover response strategies that weren't present in the SFT data at all
Scalable signal: Pairwise comparison is much faster to collect than writing ideal responses from scratch

Limitations & Failure Modes

Reward hacking: The model can find ways to score highly on the reward model without being genuinely better—like a student who learns to game a marking rubric
Sycophancy: Models often learn to tell users what they want to hear, since agreeable responses tend to receive higher ratings from annotators
Annotation disagreement: Different human annotators have different values and preferences; the reward model averages over these, possibly poorly
Expensive: High-quality human preference data from domain experts is costly to collect
Alignment tax: RLHF can reduce raw capability on some benchmarks while improving behaviour

📹 A clear explanation of RLHF

🔀 Modern Alternatives & Evolutions

RLHF's complexity—three separate training stages, expensive human annotation, and tricky RL optimisation—has driven researchers to develop cleaner alternatives. These are now used across most frontier models.

📊 Post-Training Approaches Compared

Approach	Key Idea	Key Benefit	Used In
RLHF (PPO)	Human preferences → reward model → RL policy optimisation	Flexible; captures nuanced preferences	InstructGPT, early Claude, GPT-4
Constitutional AI (CAI)	Model critiques its own outputs against a written "constitution"; uses AI feedback (RLAIF) instead of human feedback	Scalable; reduces expensive human annotation; makes values explicit	Claude models (Anthropic)
DPO (Direct Preference Optimisation)	Reformulates preference learning as a classification problem on the language model itself; no separate reward model or RL loop needed	Much simpler training; stable; similar performance to RLHF	Zephyr, Llama-3-Instruct, many open models
GRPO / ORPO	Variants that remove the reference model or combine supervised and preference objectives in a single loss	Further memory savings; single training stage	DeepSeek-R1, Qwen models

🏛️ Constitutional AI — Why It's Worth Understanding

Constitutional AI (CAI), developed by Anthropic, is the approach used to train Claude—the AI that helped design these course pages. It works in two stages:

Supervised stage: The model is shown its own potentially harmful response and asked to critique it against a list of principles (the "constitution"), then revise the response. This generates supervised training data without human annotation of every example.
RL stage: A separate AI model (rather than humans) scores responses against the constitution. This "AI feedback" (RLAIF) replaces the expensive human preference collection step.

The key advantage is that the values guiding alignment are made explicit in the constitution, rather than being implicitly encoded in annotator preferences. This also makes the process more auditable.

📄 Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022 — Anthropic)

arXiv:2212.08073 — the original paper introducing Constitutional AI. Accessible even without a deep ML background.

Direct Preference Optimisation (DPO) (Rafailov et al., 2023) — arXiv:2305.18290. The paper that made RLHF significantly simpler for practitioners.

🔧 Fine-Tuning for Researchers: PEFT & LoRA

You are unlikely to train a frontier model from scratch. But you may well want to adapt an existing model to your domain—whether that's genomics, legal text, historical archives, or clinical notes. Parameter-Efficient Fine-Tuning (PEFT) methods make this tractable even on modest hardware.

⚡ The Researcher's Dilemma

General-purpose models like Claude or GPT-4 are excellent at a wide range of tasks. But for specialised domains, they may lack specific vocabulary, output formats, or reasoning patterns. Fine-tuning lets you take an existing model's general capabilities and specialise them—without paying the enormous cost of pre-training from scratch.

The challenge: a 7B parameter model has 7 billion weights. Updating all of them requires GPU memory and compute that most researchers don't have. PEFT methods solve this by updating only a small fraction of the parameters.

Full Fine-Tuning

Update all parameters of the pre-trained model on your domain data.

Best performance: Can maximally adapt the model to new domain
Expensive: Requires the same GPU memory as pre-training a model of that size
Catastrophic forgetting: Model can lose general capabilities if training data is narrow
Practical: Only realistic for models up to a few billion parameters on consumer hardware; larger models require significant cloud compute

LoRA (Low-Rank Adaptation)

The dominant approach for researcher-scale fine-tuning. Instead of updating all weights, inject small trainable matrices alongside the existing attention weights.

Only ~0.1–1% of parameters updated: Dramatically lower memory and compute
Near-equivalent performance: For most domain adaptation tasks, LoRA matches full fine-tuning
Composable: LoRA adapters can be merged, swapped, or shared—enabling collaborative fine-tuning
Practical: A 7B model can be LoRA fine-tuned on a single consumer GPU in hours

Prompt Tuning & Prefix Tuning

Even lighter-weight alternatives: instead of modifying model weights at all, learn a set of "soft tokens" that are prepended to every input.

No weight changes: The base model is completely frozen; only the soft prompt vectors are trained
Extremely lightweight: Only a few thousand parameters to learn
Limited expressiveness: Less powerful than LoRA for significant domain shifts
Easy to switch: Swap soft prompts to change the model's domain without reloading weights

🤔 When Should You Fine-Tune vs Just Prompt?

This is a practical question researchers often face. The answer is almost always: try prompting first.

Situation	Recommendation	Why
General tasks with good instructions	Prompt engineering	Faster, cheaper, easier to iterate; frontier models are powerful with good prompts
Consistent structured output format needed	Fine-tuning	Reliable JSON/table/annotation outputs across many examples
Highly specialised vocabulary or reasoning	Fine-tuning	Domain-specific terms, notation, or reasoning patterns not in pre-training data
Running at scale with cost constraints	Fine-tune a smaller model	A fine-tuned 7B model can match a prompted 70B model at 10× lower inference cost
Only a handful of examples available	Few-shot prompting	Fine-tuning on <50 examples often leads to overfitting; few-shot prompting is safer
Want to remove certain behaviours	Fine-tuning (with care)	Prompting cannot reliably suppress capabilities; fine-tuning can shift defaults

📄 LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)

arXiv:2106.09685 — the original LoRA paper. Highly readable and has become one of the most cited papers in applied NLP.

Hugging Face PEFT Library — practical documentation covering LoRA, prefix tuning, and other PEFT methods. A good starting point if you want to try this yourself.

🔬 Practical Implications for Researchers

Understanding post-training helps you use AI tools more effectively and interpret their behaviour more accurately.

Why Models Sometimes Refuse

When a model declines to help with something, this is almost always a product of alignment training—not a fundamental limitation of what the model can do.

Refusals are trained behaviour: The underlying pre-trained model likely has the relevant capability; the post-training process has reduced the probability of certain outputs
System prompts: When you use a model via an API or product, a hidden "system prompt" is usually prepended that sets the model's persona, constraints, and defaults
Context matters: The same model will behave differently depending on the system prompt—a medical API deployment behaves differently from a general consumer product
Over-refusal is a known problem: Alignment training can make models too cautious; labs actively try to calibrate this balance

Open vs Closed Models for Research

A key practical decision is whether to use a closed API (Claude, GPT-4) or an open-weights model (LLaMA, Mistral).

Closed (API): No setup; frontier capability; data privacy questions; no fine-tuning control; ongoing costs; subject to policy changes
Open-weights: Can run locally; full fine-tuning control; data stays on your infrastructure; requires compute; smaller models than frontier closed models
For sensitive data: Open models running locally may be required if data cannot leave your institution
Key open models: Meta's LLaMA 3, Mistral, Qwen, Gemma — all have research-friendly licences

The Alignment Tax

Post-training improves behaviour but can reduce some raw capabilities—a trade-off researchers should be aware of.

Base vs instruct models: Many open model releases include both a raw base model and an instruction-tuned variant; they have different strengths
Benchmark performance: Base models sometimes score higher on specific benchmarks than their aligned counterparts—alignment optimises for helpfulness/safety, not benchmark scores
Sycophancy in research use: RLHF-trained models may agree with incorrect statements if the user seems confident—a genuine risk in research workflows requiring critical evaluation
Calibration: Aligned models sometimes express more confidence than is warranted; always verify factual claims independently

💡 A Note on Sycophancy in Research

Sycophancy—the tendency of aligned models to agree with the user rather than provide accurate information—is one of the most practically important alignment failure modes for researchers. If you present a hypothesis to an AI assistant and it agrees enthusiastically, that agreement may reflect your framing rather than truth. Techniques to mitigate this: ask the model to steelman opposing views, explicitly prompt for criticism, or present the hypothesis without indicating your own position first.

📚 Summary & Key Takeaways

You now have a picture of the full training pipeline that produces modern AI assistants:

Base models aren't assistants: Pre-training produces a powerful text predictor, not a helpful, safe conversational agent
SFT teaches format and instruction-following: High-quality (prompt, response) pairs teach the model to behave like an assistant, but via imitation
RLHF captures nuanced preferences: Human comparisons → reward model → RL optimisation produces models that are genuinely preferred by users, not just fluent
CAI and DPO offer cleaner alternatives: Constitutional AI makes values explicit; DPO removes the need for a separate RL loop
LoRA makes fine-tuning accessible: Researchers can adapt existing models to specialist domains without frontier-scale compute
Understanding post-training helps: Refusals, sycophancy, and differences between models largely reflect alignment choices, not raw capability

Next session (Week 3): We zoom out from model internals to consider the broader context—the environmental implications of AI. Training and deploying large models consumes enormous amounts of energy. What does the carbon footprint of AI actually look like, and what does sustainable AI practice mean for researchers?